Number of Instances: red wine - 1599, Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(rw)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
length(!is.na(rw)) == (1599 * 13)
## [1] TRUE
# Means no any NA value in this dataset
table(rw$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
# Plot quality counts
ggplot(aes(x = quality), data = rw) +
geom_histogram(binwidth = 0.5)
## We can see the quality only including (3, 4, 5, 6, 7, 8) and center to 5 or 6
table(rw$alcohol)
##
## 8.4 8.5 8.7 8.8
## 2 1 2 2
## 9 9.05 9.1 9.2
## 30 1 23 72
## 9.23333333333333 9.25 9.3 9.4
## 1 1 59 103
## 9.5 9.55 9.56666666666667 9.6
## 139 2 1 59
## 9.7 9.8 9.9 9.95
## 54 78 49 1
## 10 10.0333333333333 10.1 10.2
## 67 2 47 46
## 10.3 10.4 10.5 10.55
## 33 41 67 2
## 10.6 10.7 10.75 10.8
## 28 27 1 42
## 10.9 11 11.0666666666667 11.1
## 49 59 1 27
## 11.2 11.3 11.4 11.5
## 36 32 32 30
## 11.6 11.7 11.8 11.9
## 15 23 29 20
## 11.95 12 12.1 12.2
## 1 21 13 12
## 12.3 12.4 12.5 12.6
## 12 13 21 6
## 12.7 12.8 12.9 13
## 9 17 9 6
## 13.1 13.2 13.3 13.4
## 2 1 3 3
## 13.5 13.5666666666667 13.6 14
## 1 1 4 7
## 14.9
## 1
# Plot alcohol counts
ggplot(aes(x = alcohol), data = rw) +
geom_histogram(binwidth = 0.1) +
coord_cartesian(xlim = c(9, 13))
table(rw$pH)
##
## 2.74 2.86 2.87 2.88 2.89 2.9 2.92 2.93 2.94 2.95 2.98 2.99 3 3.01 3.02
## 1 1 1 2 4 1 4 3 4 1 5 2 6 5 8
## 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17
## 6 10 8 10 11 11 11 19 9 20 13 21 34 36 27
## 3.18 3.19 3.2 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.3 3.31 3.32
## 30 25 39 36 39 32 29 26 53 35 42 46 57 39 45
## 3.33 3.34 3.35 3.36 3.37 3.38 3.39 3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47
## 37 43 39 56 37 48 48 37 34 33 17 29 20 22 21
## 3.48 3.49 3.5 3.51 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59 3.6 3.61 3.62
## 19 10 14 15 18 17 16 8 11 10 10 8 7 8 4
## 3.63 3.66 3.67 3.68 3.69 3.7 3.71 3.72 3.74 3.75 3.78 3.85 3.9 4.01
## 3 4 3 5 4 1 4 3 1 1 2 1 2 2
# Plot pH counts
ggplot(aes(x = pH), data = rw) +
geom_histogram(aes(fill = 'red'), binwidth = 0.01)
# most of value between 2.8-3.8
# Plot sulphates counts as this feature have a long tail, scale value by log
ggplot(aes(x = sulphates), data = rw) +
scale_x_log10() +
geom_histogram(binwidth = 0.01)
## Below plot shows this feature have long tail and most value internel c(0.3, 1.0)
# Plot density counts
ggplot(aes(x = density), data = rw) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# The desity show a very good normal distribution
# Plot total.sulfur.dioxide counts
ggplot(aes(x = total.sulfur.dioxide), data = rw) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Most of red wine total dioxide counts should be 0-80.
# Plot chlorides counts
ggplot(aes(x = chlorides), data = rw) +
geom_histogram(binwidth = 0.001) +
coord_cartesian(xlim = c(0.03, 0.14))
## Most of value between c(0.03, 0.14), and similar a normal distribution
# Plot fixed.acidity counts
ggplot(aes(x = fixed.acidity), data = rw) +
geom_histogram(binwidth = 0.1)
# Plot volatile.acidity counts
ggplot(aes(x = volatile.acidity), data = rw) +
geom_histogram(binwidth = 0.01)
# Plot volatile.acidity counts
ggplot(aes(x = residual.sugar), data = rw) +
geom_histogram(binwidth = 0.1) +
coord_cartesian(xlim = c(1.2, 3.2))
## Most residual sugar value between c(1.2, 3.2), like a normal distribution
ggplot(aes(x = citric.acid), data = rw) +
xlab("citric acid")+
geom_bar(colour = "black", fill = "#990066")
table(rw$citric.acid)
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
## we can see most this value is 0, 0.02, 0.24,0.49
Main features is quality, we can based on other features to determind quality, we can some analysis to know which features positive affect the red wine quality, and others is opposite.
we can guess feature like residual.sugar, Ph, sulphates, chlorides, density and alcohol.
Except some long tail data is tidy except x is only index, no means in time. and sometime also found some data can need remove outlier data, like:
ggplot(aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide), data = rw) +
geom_point(aes(color = quality))
# We can two data of total.sulfur.dioxide more than 200, remove outlier
rw2 <- subset(rw, rw$total.sulfur.dioxide < 200)
ggplot(aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide), data = rw2) +
geom_point(aes(color = quality))
## Plot relations free.sulfur.dioxide with total.sulfur.dioxide
ggplot(aes(y = free.sulfur.dioxide, x = total.sulfur.dioxide), data = rw2) +
geom_point(aes(color = quality), alpha = 1/2) +
stat_smooth(method = 'lm')
## We can free.sulfur.dioxide relates with total.sulfur.dioxide, should be only consider one like total.sulfur.dioxide in future analysis.
## Plot ph and quality correction,
## transform quality to number
rw2$quality <- as.numeric(rw2$quality)
str(rw2)
## 'data.frame': 1597 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num 5 5 5 6 5 5 5 7 7 5 ...
ggplot(aes(x = pH, y = quality), data = rw2) +
geom_point(aes(color = quality), alpha = 1/2) +
stat_smooth(method = 'lm')
## For below plot, features pH has a nagative affection the red wine quality
# Plot relation about total.sulfur.dioxide and quality
ggplot(aes(x = total.sulfur.dioxide, y = quality), data = rw2) +
geom_point(color = I('#F79420'), alpha = 1/4) +
stat_smooth(method = 'lm')
## We can see total.sulfur.dioxide have nagative affect
ggplot(aes(x = residual.sugar , y = quality), data = rw2) +
geom_point(color = I('#F79420'), alpha = 1/4) +
stat_smooth(method = 'lm')
## Looks like Residual.sugar no affect for wine quality
ggplot(aes(x = chlorides, y = quality), data = rw2) +
geom_point(color = I('#F79420'), alpha = 1/4) +
stat_smooth(method = 'lm')
## Based on below plot, alcohol have stronger positive affection the wine quality
ggplot(aes(x = alcohol, y = quality), data = rw2) +
geom_point(color = I('#F79420'), alpha = 1/4) +
stat_smooth(method = 'lm')
## Based on below plot, alcohol have stronger positive affection the wine quality
ggplot(aes(x = sulphates, y = quality), data = rw2) +
geom_point(color = I('red'), alpha = 1/2) +
stat_smooth(method = 'lm')
## Based on below plot, sulphates have stronger positive affection the wine quality
ggplot(aes(x = volatile.acidity, y = quality), data = rw2) +
geom_point() +
scale_x_log10(breaks=seq(.1,1,.1)) +
xlab("log10(volatile.acidity)") +
geom_smooth(method="lm")
As upper plots shows, we know: 1. Features chlorides, total.sulfur.dioxide and pH has nagative affection for red wine quality; 2. Features sulphates, alcohol has positive affect for red wine quality 3. Looks like Residual.sugar no affect for wine quality 4. We know free.sulfur.dioxide relates with total.sulfur.dioxide, should be only consider one like total.sulfur.dioxide in future analysis.
## Plot citric.acid affect the pH
## citric acid is small quantities, citric acid can add 'freshness' and flavor to wines, should be > 0.
ggplot(data = subset(rw2, citric.acid > 0), aes(x = citric.acid, y = pH)) +
geom_point() +
scale_x_log10() +
xlab("log10(citric.acid)") +
geom_smooth(method="lm")
Based on below plot, we know followed citric.acid increase the pH turn acidic, that’s meets our knowledge.
## Plot fixed.acidity affect the pH
ggplot(aes(x = fixed.acidity, y = pH), data = rw2) +
geom_point() +
#scale_x_log10() +
xlab("fixed.acidity") +
geom_smooth(method="lm")
We can see the fixed.acidity has stronger relationship with pH.
## Plot fixed.acidity, alcohol and quality relations
ggplot(aes(y = sulphates, x = alcohol,
color = quality), data = rw2) +
geom_line() +
# select fixed.acidity
scale_y_continuous(limits=c(0.3, 1.2)) +
facet_wrap(~quality)
In part2 we know sulphates and alcohol all have postive affection for wine quality, upper plots also shows, high quality wine have high alcohol and fixed.acidity.
## Plot chlorides, alcohol and quality relations
ggplot(aes(x = chlorides, y = residual.sugar), data = rw2) +
geom_point(size = 3, shape = 1) +
scale_x_continuous(limits=c(0.05, 0.2)) +
scale_y_continuous(limits=c(1, 8)) +
facet_wrap(~quality)
## Warning: Removed 106 rows containing missing values (geom_point).
## No found obvious affection for those feature
From upper plot, looks like better quality red wine always has high alcohol and high sulphate concentrations.
I just found salt residual.sugar all no any affect to wine quality.
## Select 80% as training data, others as test data
training_data <- sample_frac(rw2, .8)
test_data <- rw2[!rw2$X %in% training_data$X, ]
## built model
m1 <- lm(quality ~ alcohol, data = training_data)
m2 <- update(m1, ~ . + sulphates)
m3 <- update(m2, ~ . + total.sulfur.dioxide)
m4 <- update(m3, ~ . + chlorides)
m5 <- update(m4, ~ . + pH)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = training_data)
## m2: lm(formula = quality ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide,
## data = training_data)
## m4: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide +
## chlorides, data = training_data)
## m5: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide +
## chlorides + pH, data = training_data)
##
## ==============================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------
## (Intercept) 1.854*** 1.416*** 1.712*** 2.037*** 4.217***
## (0.194) (0.198) (0.208) (0.214) (0.459)
## alcohol 0.363*** 0.352*** 0.333*** 0.304*** 0.322***
## (0.019) (0.018) (0.019) (0.019) (0.019)
## sulphates 0.835*** 0.881*** 1.164*** 1.060***
## (0.110) (0.110) (0.121) (0.121)
## total.sulfur.dioxide -0.003*** -0.003*** -0.003***
## (0.001) (0.001) (0.001)
## chlorides -2.343*** -2.718***
## (0.435) (0.436)
## pH -0.687***
## (0.128)
## ----------------------------------------------------------------------------------------------
## R-squared 0.231 0.264 0.275 0.292 0.307
## adj. R-squared 0.231 0.263 0.274 0.289 0.304
## sigma 0.706 0.691 0.686 0.679 0.671
## F 384.099 229.215 161.421 130.962 112.801
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1367.451 -1339.329 -1329.738 -1315.365 -1301.088
## Deviance 635.976 608.594 599.528 586.193 573.241
## AIC 2740.901 2686.658 2669.476 2642.730 2616.176
## BIC 2756.361 2707.270 2695.242 2673.648 2652.247
## N 1278 1278 1278 1278 1278
## ==============================================================================================
## cal error
df <- data.frame(
test_data$quality,
predict(m5, test_data) - test_data$quality)
names(df) <- c("quality", "error")
ggplot(aes(x = quality, y = error), data = df) +
geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## error close 0, means we can use upper features to determine red wine qaulity
| # Final Plots and Summary |
| ### Plot One |
| ### Description One In this data set most of quality is 5 or 6, not enough bad/best quality to do analysis, so that this dataset may not good for adjust very good red wine. |
| ### Plot Two |
## Warning: Removed 67 rows containing missing values (geom_point). |
| ### Description Two From upper plot, we know sulphate and alcohol all have affection the red wine quality. |
| ### Plot Three |
| ### Description Three As we have enough data for bad/good(> 7 or < 4) quality, we can use this dataset do linear regression shows error are very big. |
Try to use R to analysis red wine data, to find out what’s features will affection red wine quality. 1. Try use Univariate Analysis single feature relationship with wine quality; 2. use Bivariate to analysis more feature affect the wine quality; 3. also try use linear regression use top 5 features to analysis prediction quality.
and, R is very different with python, need more exercise to familary this tools, in fact, like ggplot2 is very good plot package, it’s more simple than matplotlib in python.